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Abstract 

As a technology to read brain states from measurable brain activities, brain 
decoding are widely applied in industries and medical sciences. In spite 
of high demands in these applications for a universal decoder that can be 
applied to all individuals simultaneously, large variation in brain activities 
across individuals has limited the scope of many studies to the development 
of individual-specific decoders. In this study, we used deep neural network 
(DNN), a nonlinear hierarchical model, to construct a subject-transfer de¬ 
coder. Our decoder is the first successful DNN-based subject-transfer de¬ 
coder. When applied to a large-scale functional magnetic resonance imaging 
(fMRI) database, our DNN-based decoder achieved higher decoding accuracy 
than other baseline methods, including support vector machine (SVM). In 
order to analyze the knowledge acquired by this decoder, we applied prin¬ 
cipal sensitivity analysis (PSA) to the decoder and visualized the discrim¬ 
inative features that are common to all subjects in the dataset. Our PSA 
successfully visualized the subject-independent features contributing to the 
subject-transferability of the trained decoder. 

Keywords: fMRI, Brain decoding, subject-transfer decoding, deep neural 
network (DNN), principal sensitivity analysis (PSA), brain machine 
interface (BMI) 


* Corresponding author 

Email address: koyamada-s@sys . i . kyoto-u. ac. jp (Sotetsu Koyanrada) 


Preprint submitted to Neural Networks SI: NN Learning in Big Data February 3, 2015 



1. Introduction 


Brain decoding is an act of decoding exogenous and/or endogenous brain 


states from measurable brain activities ( 

Haxby et al. 2001 

Cox and Savoy| 

2003 

Kamitani and Tong, 

2005 

Shibata et al. 

2011 

Horikawa et al. 

2013). 


It has been attracting much attention in medical and industrial fields as a ma¬ 
jor next-generation technology. Possible applications include brain machine 


interface (BMI) (LaConte, 2011), neuro rehabilitation (Sitaram et al., 2012) 


and therapy of mental disorders (Sitaram et al. 2007). Brain decoding is a 


function that takes brain activities as input and brain states as output, and 
its performance is evaluated by how well it approximates the real association 
between these two entities. As such, it falls into the category of machine 
learning, and it is often studied in the particular framework of supervised 
learning (Lemm et al. 2011). As in the case of any other applications of 


machine learning, the performance of the decoder depends heavily on the 
quality and quantity of the data used for its training. 

Much difficulty remains in obtaining sufficient data from a single individ¬ 
ual to build a reliable decoder. One can extract only so much information 
from a single subject, because there is a limit to the mental and physical 
stress that the subject can endure. One would therefore seek a decoder that 
can be trained using a big data amassed from multiple subjects, so that we 
can reduce the stress per subject while granting the decoder an ability to si¬ 
multaneously decode heterogeneous dataset, i.e., a subject-transfer decoder 


(Fazli et al. 

2009; 

Raizada and Connolly 

2012; 

Marquand et al. 

2014) 


an ideal subject-transfer decoder, its decoding accuracy does not deteriorate 
over the dataset obtained from the population outside the group of individ¬ 
uals used in its training. However, large subject-wide variation of the brain 
activities has long hindered the development of such decoders. The scope of 
the brain decoding studies up until recent years has hence been restricted 
to subject-specific ( tailor-made ) decoder that can only cope with the data 
from the very subject who provided the training dataset (e.g., Cox and Savoy 


(2003); Haxby et al. (2001); Nishimoto et al. (2011); Horikawa et al. ( 2013[ ) ). 

In this work, we will present a subject-transfer decoder in the form of 
a deep neural network (DNN) trained with big data, a decoder aimed at 
classifying the brain activities into seven cognitive task categories. For both 
training and testing, we used a large fMRI dataset in Human Connectome 
Project (HCP) gathered from over 500 subjects. The application of DNN 
to fMRI dataset is not new; Plis et al. (2014) used a fMRI-trained DNN to 


2 











































































































study schizophrenia patients, and Hatakeyama et al. (2014) used still an¬ 


other variation of fMRI-trained DNN to classify hand motions. Our work is 
the first of its kind in using a DNN to construct a subject-transfer decoder 
from big data. Our subject-transfer decoder achieved higher decoding accu¬ 
racy than any other baseline methods like support vector machine (SVM). 
This is indicative of DNN’s superior generalization ability over heterogeneous 
big data. Also, decoding accuracy improved monotonically as the number 
of training subjects increased. In the light of the fact that we are engaged 
in subject-transfer decoding, this monotonic trend suggests that our train¬ 
ing is successfully extracting more subject-independent features from larger 
dataset. This also shows that the size of the dataset contributes to the ro¬ 
bustness of the decoder over heterogeneous population. DNN together with 
big data thus emerges as an successful new approach to the subject-transfer 
decoding. 

In order to further assess the efficiency of our trained DNN, we applied 


principal sensitivity analysis (PSA) (Koyamada et al. 2014), a brand new 


knowledge discovery procedure, to highlight the subject independent features 
used by the decoders for its function. By illustrating these features on the 
map of brain, we were able to make some connections between these features 


and functional connectivity reported in human fMRI studies (Raichle et al. 


, 2007 

Taylor et al. 

2009; 

Cole et al. 

2013) 


This paper is structured as follows. In the method section, we will provide 
the settings under which we trained our DNN, along with the specification 
of the dataset. We will also provide the theoretical background of the PSA. 
In the result section, we will compare the performance of our DNN against 
standard classification techniques, and show how the decoding accuracy of 
the DNN improved as the size of the training dataset increased. We will also 
provide an interpretation of the PSA in terms of functional connectivity in 
the brain. 


2. Methods 

2.1. fMRI data acquisition and preprocessing 

Human Connectome Project (HCP) is a scientific project “to map macro¬ 
scopic human brain circuits and their relationship to behavior in a large pop¬ 
ulation of healthy adults” (Van Essen et al., 2013), and it provides one of the 
largest open databases of fMRI that are publically available today. In this 
study, we used the task-evoked fMRI data collected from 499 participants 
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in Quarter 1 through Quarter 6, which were preprocessed and registered by 


Van Essen et al.| ( |2013[ ); |Glasser et al.| ( |2013[ ) (HCP S500 release). For more 
details, see the HCP release reference manuaQ 

In this section, we will provide key data specifications and preprocessing 
procedure. EV1RI data of 499 healthy adults were acquired by a Siemens 3T 
Skyra, with TR = 720 ms, TE = 33.1 ms, flip angle 52°, FOV = 208 x 180 
mm, 72 slices, 2.0 x 2.0 mm in plane resolution. The preprocessing that had 
been applied to the fMRI data in the HCP prior to our own modification in¬ 
cludes removal of spatial artifacts and distortions, within-subject cross-modal 
registrations, reduction of the bias field, and alignment to standard space. 
In addition to these processes, we applied voxel-wise z-score transformation 
to the data and averaged the intensity over each anatomical region of inter¬ 
est (aROI). The intention of the latter averaging procedure is to help the 
decoder learn features that are robust against large inter-subject variability 
of brain activities. aROIs were determined by the automated anatomical 


labeling method (Tzourio-Mazoyer et ah, 2002). In the end, the dimension 


per each preprocessed fMRI scan became 116. 

The 499 participants (subjects) in the dataset we studied were asked to 
perform seven tasks related to the following categories: Emotion, Gambling, 
Language, Motor, Relational, Social and Working Memory (WM). Each sub¬ 
ject performed each task twice with time limits that varies across different 
tasks (see Table[l]). Note that the number of scans conducted in the experi¬ 
ment varies across different tasks. One hundred unrelated subjects completed 
all seven tasks. The WM class occupied the largest proportion (20.88%) of all 
scans for each subject. The experimental design of each task is summarized 
below. See Barch et al. (2013) for more details. 


2 . 


3. 


Emotion: Participants were asked to match one of two simultaneously 
presented images with a target image (angry face or fearful face). This 
is a modified version of the emotion task employed in Hariri et al. 
( 2002 ). 

Gambling: Participants were asked to play a simple game to get 


money. See Delgado et al. (2000) for more details. 


Language: After listening to a brief story, participants were asked to 
answer a two-alternative forced choice question about the topic of the 
story. See Binder et al. (20lT| for more details. 


1 www.humanconnectome.org/documentation/S500 
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4. 

5. 

6 . 


7. 


Motor: Participants were requested to move one of five body parts 
(left or right finger, left or right toe, or tongue) as instructed by a 


visual cue (Buckner et ah, 2011) 


Relational: Each participant was presented with two pairs of objects, 
and was subsequently asked to answer a second-order question regard¬ 
ing the shapes/textures of the objects. 

Social: Participants were presented with a movie clip, and were asked 
to decide whether the movements of the objects in the clips are related 
with each other in some way. The movie clips were originally prepared 
by Castelli et al.| ( [2000 ) and Wheatley et al. (2007). 

WM (Working Memory): Participants were asked to complete two- 
back working memory tasks and zero-back tasks with four different 
types of image stimuli (places, tools, faces or body parts). 


Table 1: Number of scans per session and its duration (minrsec) 



Emotion 

Gambling 

Language 

Motor 

Relational 

Social 

WM 

Scans 

176 

253 

316 

284 

232 

274 

405 

Duration 

2:16 

3:12 

3:57 

3:34 

2:56 

3:27 

5:01 


2.2. Deep neural networks 

We trained a deep neural network (DNN) with the input being the fMRI 
signals over aROIs and the output being their labeled task classes, i.e., the 
category of cognitive task performed by the participants. Prior to the training 
step, all fMRI scans were categorized into seven task classes, completely 
disregarding the time order. The weight parameters of the DNN were then 
trained to optimize the probability of successfully classifying the fMRI scans 
into the seven task categories (Fig.[TJ) . 

We trained feed-forward neural networks (i.e., DNNs) with L hidden lay¬ 


ers with stochastic gradient descent (SGD) with dropout (Hinton et al. 
2012). The internal potential of the i-th unit in the Z-th hidden layer a\ l 


(/ = 1, • • • , L) is given as a weighted summation of its inputs: 


ni -1 

(,, -E 

3 = 1 


a- = 


u>ij z j 1) + b?\ 


( 1 ) 
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Figure 1: We trained DNNs with the input being the fMRI scans and the output being 
their labeled task classes. 


where wf- and lip are a weight and a bias, respectively. Here, rp is the number 
of units in the l -th hidden layer, which was set at ni = 500 for any l > 0. We 
define as the input vector x to the network. This forces no to be equal 
to the input’s dimensionality d{= 116). We denote z® = • • • , zp j as 

the outputs of the l -th hidden layer. These outputs are given by applying a 
nonlinear activation function h to the internal potential as 



( 2 ) 


Here ReLU (Jarrett et ah, 2009), a piecewise linear function max(0,x), was 
used for the activation function h. ReLU as a choice of the activation function 
has a couple of advantages; its piecewise linearity can save the computational 
cost to calculate its derivative, and its non-saturating character in the pos¬ 
itive domain prevents the learning algorithm from halting due to gradient 
vanishing of nonlinear activation functions. The last hidden layer was con¬ 
nected to the softmax (output) layer, so that the output from the fc-th unit 
of the output layer can be interpreted as the posterior probability of class k, 
given by 


P(Y = k | x, W) = 


ex P 1 w kjZj L) + bPj 


£?=i ex P (YPp 




(L) 


+ b 


k' 


( 3 ) 


where K{— 7) is the number of classes, and W denotes all the parameters 
(weights and biases) of the whole network. Here, Y is a random variable 
signifying the class to which x belongs. 
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We used a negative log-likelihood as the cost function of the learning 

N 

L(W) = -J2 P(Y n = t n | x n , W), (4) 

n= 1 


where {(aq, tf), ■ ■ ■ ,(x^, tjy)} constituted the given dataset. Here t £ {1, • • • , K } 
denotes a class label. To minimize the above cost function, minibatch stochas¬ 
tic gradient descent (MSGD) was introduced so that SGD was performed 
every 100 samples: 


W t = W t _i - th 


dL'{W) 

dW 


W t -/ 


( 5 ) 


where II is the cost function for the cached subset of 100 samples in the 
minibatch, and r\ t is the learning rate. For our case we adopted a constant rate 
rj t = 7] 0j and this value was set at either of {0.1,1.0} that yielded better result 
for the validation dataset (see Section [T3] ). Each weight of hidden layers 
was initialized at a small value randomly sampled from a zero-mean normal 
distribution with the standard deviation of 0.01, and weights of softmax layer 
and biases were initialized to zero. Early stopping was also adopted; if the 
decoding accuracy for the validation dataset did not increase for 100 learning 
epochs, then learning was terminated. 


To further avoid over-fitting, we used the dropout technique (Hinton 


et al. 


2012). During the training, the activity zf' 1 was randomly replaced 


by 0 with probability p. We set p = 0.5 for hidden units and 0.2 for inputs. 
This drop-out procedure plays a role of regularization and is expected to 
prevent the decoder from acquiring subject-specific features. When testing 
the trained neural network, all the nodes were activated, and their weights 
were multiplied by 1 — p. This is to make the mean activity level of each 
network element consistent between the training phase and the test phase. 


2.3. Subject-transfer decoding 

To examine the subject-transferability of our decoder’s architecture, we 
selected hundred individuals from all 499 subjects who (1) are unrelated with 
each other and (2) successfully completed all seven cognitive tasks twice. Let 
D be the dataset of these 100 subjects. We then executed a leave- 10-subjects- 
out (or 10-folds ) cross calidation to the dataset D. To be more specific, the 
D was partitioned into a test dataset of 10 subjects, a validation dataset of 
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10 subjects, and a training dataset of 80 subjects without any overlap. We 
trained our DNN with the training dataset, while using the validation dataset 
to determine the hyper parameters and to perform early stopping. We then 
tested the decoding accuracy of trained DNNs over the test dataset. We 
repeated this cycle 10 times, choosing different test and validation datasets 
in each iteration. In order to examine how the size of the dataset influences 
the decoding accuracy, we conducted the above experiments with different 
size of training dataset (10 ~ 80 subjects) without changing the test dataset 
and the validation dataset. 


2-4- Analysis for trained decoder 

The purpose of the brain decoder is here to classify the brain activities 
into seven categories. The result of the classification itself, however, does not 
necessarily give informatoin of neuroscientific bases behind the classification. 
One might therefore wish to investigate the decoder in an attempt to learn 
the signature that characterizes each target class. This approach relies on the 
philosophy of “knowledge discovery,” and one may interpret these signatures 
acquired by the decoder as the decoder’s knowledge. Any association between 
the knowledge of the brain decoder and the knowledge of the neuroscience 
can help assess the reason why the decoder performs well/terrible, and in 
some case help understand the neural bases useful in brain decoding. In the 
case of linear decoders, the weight visualization is often used for this purpose 

Such visualization can be 


)Miyawaki et ah, 

2008 

Abraham et ah, 

2014) 


inappropriate for decoders with a nonlinear and hierarchical architecture like 
DNN because the middle layer will mask the direct relation between the input 
and output. A well known alternative to the weight visualization in such cases 


is the sensitivity analysis (Zurada et al. 1994, 1997 Kjems et ah, 2002), 


which computes the expected sensitivity of the classifier’s output (posterior 
probability of the successful classification) with respect to the perturbation 
in the input. In this study, we applied principal sensitivity analysis (PSA) 
introduced by the authors (Koyamada et ah 2014), a PCA like extension of 
the sensitivity analysis, to our DNN-based decoder. PSA distinguishes itself 
from the ordinary sensitivity analysis in that it can identify the direction in 
the input space to which the classifier is most sensitive. It can also decompose 
the input space into the classifier-sensitive spaces and rank them in order 
of sensitivity. In next two subsections, we briefly describe the sensitivity 
analysis and PSA. 





















2-4-1- Sensitivity analysis 

Let fk{x) : R d —> M be the logarithm of the output from the k-th unit in 
the final layer, namely 


/*(*) :=logP{Y = k\x,W), (6) 


where W is the parameters of the trained decoder. For simplicity, we would 
omit index k in the rest of this section. The sensitivity of / with respect to 
the i-th input feature is defined by 


Si . Eq 



( 7 ) 


where q is the true input distribution. In actual implementation, the expec¬ 
tation (|7j) is computed with respect to the empirical distribution of the test 


dataset. Kjerns et al. 


(2002) defined the vector 

S . (Si,.. . , Sd) 


( 8 ) 


of these values as sensitivity map over the set of input features. This 
sensitivity map will give us a measure for the degree of importance that the 
classifier puts to each input. 


2-4-2. Principal Sensitivity Analysis (PSA) 

The purpose of the PSA is to compute the direction v for which / is 
most sensitive in the input space. This amounts to solving the following 
optimization problem about v : 


maximize s(v) 

t (9) 

subject to v v = 1, 

where s(v) is the sensitivity of / for the direction v , given by 

S (v) := E q [||V„/(*)I0 . (10) 

where ||-|| 2 defines the L2-norm. Recall that the directional derivative is 
defined by 

V v /(*) = f(x). (11) 
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Because we can rewrite s(v) as v T Kv, where K := E q \V f(x) V/(a?) T ], the 
optimization problem (J9]) equals to 


maximize v 1 Kv 
subject to v T v = 1. 


( 12 ) 


The solution to this problem is simply the maximal eigenvector ±v* of K. 


Koyamada et al. (2014) defined this vector as principal sensitivity map 


(PSM) over the space of input features. The magnitude of Vi represents the 
extent to which / is sensitive to the i-th input feature, and the sign of Vi tells 
us the relative direction to which the input feature influences /. Recall that, 
if the positive definite matrix K is replaced by the covariance E q [xx T ], where 
x is the centered random variable, the optimization problem (12) can be seen 
as the problem of solving for the first principal component of the ordinary 
PCA. Since the k-th component of the ordinary PCA is the k-th dominant 
eigenvector of the covariance matrix, the k-th dominant eigenvector of K can 
be called the k-th principal sensitivity map (A>th PSM). These sub¬ 
principal sensitivity maps grant us access to even richer information that 
underlies the dataset through the classifier. 


3. Results 

First, we compared the decoding accuracy of the DNNs with those of 
other baseline methods using the dataset D (see Section [2R| . We trained 
three neural networks with one, two and three hidden layers, each with the 
output logistic regression layer. The baseline methods investigated in this 
study include logistic (softmax) regression, which corresponds to 0-hidden 
layer neural network and SVMs with linear kernel and RBF kernel (see 
Appendix A for the specification of these baseline methods). The mean 
decoding accuracy and its standard deviation in the leave-10-subjects-out 
cross validation are summarized in Tablc[3j The decoding accuracies of the 
DNN decoders not only exceeded the prior chance level (the true fraction 
of the largest class, 20.88%), but were also higher than those of the other 
baseline methods. In particular, the DNN with two hidden layers exhib¬ 
ited the best decoding accuracy of 50.74%, which was significantly higher 
than that of the RBF-kernel SVM (p < 0.01, one-sided Welch test). Lin¬ 
ear methods, the logistic regression and the linear-kernel SVM, showed poor 
decoding accuracies that are comparable to the chance level. These results 
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Table 2: 

Sujbect-transfer decoding performance 




Method 

Architecture Mean accuracy [%] ± 

s.d. 

Logistic regression 

116-7 

20.81 

± 

0.15 

Support vector machine 

Linear kernel 

20.87 

± 

0.01 

Support vector machine 

RBF kernel 

47.97 

± 

1.57 

Neural network 

116-500-7 

48.94 

± 

1.15 

Neural network 

116-500-500-7 

50.74 

± 

1.25 

Neural network 

116-500-500-500-7 

50.57 

± 

1.31 

Prior chance level 



20.88 


clearly show the advantage of nonlinear decoders over linear decoders in the 
subject-transfer setting, and suggest that the DNN may be more effective in 
extracting subject-independent features from big data than the other baseline 
methods. 

Second, we investigated how the decoder’s performance changes with the 
size of training dataset. We trained the decoder with various sizes of training 
dataset, and plotted the decoding accuracy against the dataset size. In this 
set of experiments, we employed the DNN with two hidden layers (L = 2), 
which showed better decoding accuracy than the L — 1 and L — 3 versions 
over the dataset D. To evaluate the performance of a DNN decoder trained 
with a training set of M subjects, we used the following cross validation 
procedure. The setup of the cross validation is basically same as the one 
explained in Section 2.3 At each iteration of the cross validation procedure, 


we selected 10 subjects for the test set, 10 subjects for the validation set, 
and M subjects for the training set from the dataset D without any overlap. 
We repeated this process 10 times, selecting 10 entirely new subjects for 
the test set at each iteration. We conducted this set of iteration procedure 
for M — 10 ~ 80. In order to check the asymptotic trend of the decoding 
accuracy, we examined the M = 479 case as well, in which all of the 499 
subjects registered in the S500 release were used. The results are displayed 
in Fig.[2j As the number of subjects in the training dataset increased from 10 
to 80, the performance of the DNN decoder also increased, as expected. The 
performance was best at M = 479. We attribute this trend to the positive 
relationship between the size of the training dataset and the reliability of 
the subject-independent features captured by the DNN decoder. This result 
also implies that our DNN-based subject-transfer decoding would become 
increasingly more practical if we can access to the brain signal databases 
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gathered from even larger set of subjects. DNN together with big data thus 
proves to be an effective approach in subject-transfer decoding. 


f DNN (2 hidden layers) 




10 20 30 40 50 60 70 80 100 

log(# of training subjects) 



479 


Figure 2: Mean decoding accuracy plotted against the number of training subjects (M) 
in log scale. Error bars indicate s.d. across ten cross validation iterations. 


In order to extract the discriminative features in brain activities captured 
by brain decoder trained with the large scale database, we conducted princi¬ 
pal sensitivity analysis (PSA). PSA is different from the standard sensitivity 
analysis in that it quantifies the combinations of aROIs used in the decoder’s 
classification, whereas the standard sensitivity analysis computes the inde¬ 
pendent contribution of each aROI to the decoder’s decision. The PSA is 
more suited for our purpose because there is a strong evidence of functional 
connectivity among aROIs (Buckner et al. 2013; Cole et al. 2014), and classi¬ 
fiers with high decoding accuracy are likely to capture coactivation patterns 
of aROIs. The expectation in equation (10) depends on the distribution q 
over the sample space. In order to best approximate the true distribution of 
q, we used the empirical distribution over the dataset of all subjects except 
the subjects used in D. Fig. 3](a) shows the first, second and third PSMs for 
Emotion and Motor classe^ The set of PSMs presented in Fig.[3](a) is one 
of the 10 variants of the sets of PSMs obtained in the leave-10-subjects-out 
cross validation over the dataset D. These variants did not exhibit siginihcant 
variation. PSMs are superior to standard sensitibity maps in that they can 
describe the aROIs which act oppositely in characterizing the class. Any pair 
of anatomical regions with different color assignments in Fig. [3] contributes to 


2 See Appendix B Fig. 


B.4 


for the PSMs of the other classes. 
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the classifier’s decision in opposite direction. Our PSA seems to imply that 
the information learned by our DNN-based subject-transferable decoder has 
some correlation with the existing knowledge of functional connectivity sup¬ 
ported in neuroscience. For instance, in the 2nd PSM of Motor class and the 
first PSM of Social class (Fig. [3]), we can identify the two sets of functional 
connectivitiy established in previous works, namely fronto-parietal network 
(Cole et al., 2013) and salience network (Taylor et al., 2009). In the 2nd 


PSM of Social class (Fig.[3|^ we can also find a component of the default 
mode network ( |Raichle and Snyder , 2007| Raichle et al. 2001) in the left 
hemisphere. In addition, to quantify the similarity of PSMs, we calculated 
the absolute cosine similarity for each pair of the PSMs and aligned the maps 
by hierarchical clustering (Fig.|3](b)) . For any pair of PSMs (vi, V 2 ), the ab¬ 
solute cosine similarity was calculated by |(ui,'U 2 )|, where (•, •) is an inner 
product, because of ||u ||2 = 1 for each PSM v. In the similarity matrix, we 
can confirm two large clusters, consisting mainly of first and second PSMs, 
respectively. This implies that the funcational connectivity interpretation 
that we made above for the first and the second PSMs of the Social class and 
the the first PSM of the Motor class applies to all the other PSMs sharing 
the same cluster memberships (see also Appendix B Fig. B.4). On the other 
hand, the sub-principal PSMs, such as the third PMSs of Emotion and Motor 
classes, showed task-specific features. Finally, note that many of our PSMs 
span large brain regions. This suggests that the subject independent fea¬ 
tures that we extracted from the large fMRI database in our deep learning 
procedure are specific (common) brain-wide networks that activates during 
particular (all) task(s). 


4. Conclusion 

I 11 this study, we applied deep learning to a large fMRI database to classify 
brain activities into seven human categories. In particular, we constructed 
a DNN-based brain decoder aimed at classifying the brain activities of any 
arbitrary individual. The strength of DNN is its high ability to capture non¬ 
linear, high dimensional features from bigdata. In fact, the decoding accuracy 
of our DNN was superior to those of other baseline methods, including SVMs 
and logistic regressions. The high performance of our DNN over the dataset 
acquired from a large population is indicative of the ability of our decoder 
to capture nonlinear discriminative features that are common to all subjects. 
Interestingly, when we visualzied these universal discriminative features in 
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3rd PSM (Language) 
3rd PSM (Gambling) 
3rd PSM (Motor) 

3rd PSM (Wm) 

2nd PSM (Social) 

3rd PSM (Relational) 
3rd PSM (Emotion) 
3rd PSM (Social) 

2nd PSM (Wm) 

2nd PSM (Relational) 
2nd PSM (Gambling) 
2nd PSM (Emotion) 
1st PSM (Motor) 

1st PSM (Language) 
2nd PSM (Motor) 

1st PSM (Emotion) 
1st PSM (Wm) 

1st PSM (Social) 

1st PSM (Relational) 
1st PSM (Gambling) 
2nd PSM (Language) 



0.0 0.5 1.0 


Figure 3: (a) We calculated the first, second and third PSMs for each class. The PSMs of 
Motor and Social classes are shown here (see Appendix B Fig. B.4 for the other classes). We 
recall that we intentionally omitted the class index in the formulation ( |12| ) for simplicity, 
and that PSMs are actually meant to be computed for each class separately. In the 
visualization of each PSM, we exclusivly colored the ROIs with values that are at least 
one s.d. away from the mean. Red and blue indicate different signs in the PSM values, 
i.e., the corresponding aROIs act oppositely in characterizing the class, (b) Similarity 
between maps was evaluated by cosine similarity. We calculated the absolute value of 
cosine similarity (for definition, see the text) between all pairs of the first, second and 
third PSMs of the seven classes. Based on these similarity values, we applied a hierarchical 
clustering to the set of all the computed PSMs. The similarity matrix above is based on 
leaf order in the dendrogram (attached to the left side of the matrix) obtained by the 
hierarchical clustering. Intensity of the color indicates the degree of similarity. 
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the form of PSMs, we were able to find non trivial associations between 
the features and functionally connected networks recognized in neuroscience. 
This observation suggests that the functional connectivity common to all 
subjects are playing important roles in characterizing task-specific brain ac¬ 
tivities. Furthermore, our DNN-based decoder leaves some room for further 
improvement. For instance, the training of DNN and its final performance 
depend largely on the initial parameters. One may use our DNN as a initial 
model to construct a decoder that is highly tuned for a specific individual. 
One can expect such decoder to utilize both subject-independent features 
and subject-specific features. As another interesting extension to this re¬ 
search, we may add demographic features like age and sex to the model. The 
flexibility of DNN architecture allows for numerous modification of the base 
model. Also, the relationship between the number of training subjects and 
the decoding accuracy was positive. With a bigger and more multimodal 
dataset and a DNN with more sophisticated architecture, one might be able 
to capture richer signatures that are otherwise difficult to uncover, and such 
signatures might inspire a novel insight into neural bases recruited in differ¬ 
ent situations in the brain. Our results hint that advanced machine learning 
techniques will continue to grow in importance in the coming era of compu¬ 
tational neuroscience that finds its basis in the statistics of heterogenous big 
data. 
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Appendix A. Specifications of baseline methods 


Each version of SVM consists of seven one-versus-the-rest classifiers. As 
for the SVMs, we used scikit-learn (Pedregosa et al. 2011). Hyper-parameters 


were chosen to maximize the decoding accuracy over the validation dataset 
(see Section [273| ; we heuristically prepared nine sets of hyper-parameters for 
each method, and adopted the one that achieved the best decoding accuracy 
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for the validation dataset. The hyper-parameter for the logistic regression 
was the learning rate in the MSGD. The hyper-parameter for the linear- 
kernel SVM was the regularization strength C used in the scikit-learn. The 
RBM-kerncl SVM was dependent on a pair of hyper-parameters, the regu¬ 
larization strength C and the kernel width 7 . We considered 3 values each 
for C and 7 , and examined all nine pairs for the best choice. 
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